Lab1:
Evaluating PROGRESA

Edward Vytlacil

Conditional Cash Transfer Programs

  • Government transfer of money
    • to households in extreme poverty,
    • conditional on household’s children being enrolled in school.
  • Goals:
    • reduce child labor,
    • increase human capital investments in children,
    • break “cycle of poverty.”

PROGRESA

  • Early CCT Program in Mexico.

  • Implemented as part of a RCT in 1997.

  • Experimental results were very positive.

  • See, e.g., Skoufias et al. (2001) (link)

PROGRESA

  • Experimental results were very positive.

  • Led to rapid expansion in Mexico:

    • covered 2.6 million Mexican households in 50,000 villages by 2001,
    • renamed “Oportunidades” in 2002.
  • Led to similar programs in Bangladesh, Brazil, Cambodia, Chile, Colombi, Egypt, Guatemala, Honduras, Indonesia, Jamaica, Nicaragua, Panama, Peru, Phillipines, Turkey and the United States

PROGRESA Eligibility Requirement

  • Extremely poor households,
  • living within poor, rural villages.

PROGRESA Treatment: Cash transfer

  • to mother of the household,

    • Thomas (1990) shows in Brazilian data that, when more money is controlled by mothers, more money is spent on the children, especially on girls.

PROGRESA Treatment: Cash transfer

  • to mother of the household,

  • conditional on school enrollment between 3rd and 11th grades for children less than 18 years old,

  • higher for

    • girls than for boys from 7th-9th grade,
    • 9th and later grade than for earlier grades.

RCT to Evaluate PROGRESA

Of 506 poor, rural villages:

  • Two-thirds were randomly assigned to be treated villages,
    • All eligible (sufficiently poor) households in treated villages were treated.
  • The remaining villages were randomly assigned to be control villages,
    • No households in control villages were treated.

RCT: 5 Survey Waves

  • Two baseline (pre-treatment) survey waves:
    • October 1997 and March 1998;
  • Three survey waves after treatment began:
    • October 1998, May 1999, and November 1999.

Analyze PROGRESA Data Using R

Libraries

Start by loading the following libraries. Install them first if you have not already done so.

  • stargazer for creating tables, by Hlavac (2022)
  • ggplot2 for making figures, by Wickham (2016)
  • plotly for making figures interactive, by Sievert (2020)
Code
library(stargazer)
library(ggplot2)
library(plotly)

Read in the data, examine the data

  • Set working directory,
  • Use read.csv to read the downloaded data into a data.frame.
setwd("~/Dropbox/quarto_fun/Labs/PROGRESA") #my path, modify for your computer
df <- read.csv("PROGRESA.csv", header=TRUE, sep=",") # read csv

Examine the data

  • Examine data, make sure we understand variables, data structure, and look for any issues with the data.
  • Explore the data with R functions such as names, dim, head, table and summary.
names(df)
 [1] "wave"         "sooloca"      "indivill_seq" "villsize_t"   "progresa1"   
 [6] "sooind_id"    "sex1"         "age1"         "hgc1"         "poor1"       
[11] "school"      
dim(df)
[1] 74031    11

Examine the data

  • Explore the data with R functions such as names, dim, head, table and summary.
head(df)
  wave sooloca indivill_seq villsize_t progresa1 sooind_id sex1 age1 hgc1 poor1
1    1       1            1         76         1        18    0   10    1     1
2    1       1            2         76         1         5    0   11    2     1
3    1       1            3         76         1        30    0    9    2     1
4    1       1            4         76         1        42    1   14    3     1
5    1       1            5         76         1        14    1    7    0     1
6    1       1            6         76         1         8    0   12    2     1
  school
1      1
2      1
3      1
4      0
5      1
6      1

Examine the data

  • Explore the data with R functions such as names, dim, head, table and summary.
table(df$wave)

    1     2     3     4     5 
14996 12368 15455 15610 15602 
table(df$progresa1)

    0     1 
27723 46308 
table(df$hgc1)

    0     1     2     3     4     5     6     7     8     9 
 4669  8863  9685  9876  8926  8187 13296  3858  3049  3622 

Examine the data

  • Explore the data with R functions such as names, dim, head, table and summary.
summary(df)
      wave         sooloca     indivill_seq      villsize_t    
 Min.   :1.00   Min.   :  1   Min.   :  1.00   Min.   :  1.00  
 1st Qu.:2.00   1st Qu.:131   1st Qu.:  8.00   1st Qu.: 26.00  
 Median :3.00   Median :253   Median : 18.00   Median : 39.00  
 Mean   :3.06   Mean   :254   Mean   : 25.63   Mean   : 50.26  
 3rd Qu.:4.00   3rd Qu.:395   3rd Qu.: 34.00   3rd Qu.: 63.00  
 Max.   :5.00   Max.   :491   Max.   :210.00   Max.   :210.00  
   progresa1        sooind_id          sex1            age1      
 Min.   :0.0000   Min.   :    1   Min.   :0.000   Min.   : 0.00  
 1st Qu.:0.0000   1st Qu.: 3983   1st Qu.:0.000   1st Qu.: 9.00  
 Median :1.0000   Median : 7894   Median :0.000   Median :11.00  
 Mean   :0.6255   Mean   : 7877   Mean   :0.485   Mean   :11.33  
 3rd Qu.:1.0000   3rd Qu.:11768   3rd Qu.:1.000   3rd Qu.:14.00  
 Max.   :1.0000   Max.   :15669   Max.   :1.000   Max.   :99.00  
      hgc1           poor1       school      
 Min.   :0.000   Min.   :1   Min.   :0.0000  
 1st Qu.:2.000   1st Qu.:1   1st Qu.:1.0000  
 Median :4.000   Median :1   Median :1.0000  
 Mean   :4.029   Mean   :1   Mean   :0.8275  
 3rd Qu.:6.000   3rd Qu.:1   3rd Qu.:1.0000  
 Max.   :9.000   Max.   :1   Max.   :1.0000  

Examine the data

  • poor==1 for all observations, so that all observations are poor enough to be eligible for PROGRESA.
  • The youngest child is 0 years old, while the oldest child is 99 years old.
    • Possibly data entry error? such data problems very common.

Examine Unusual Age Observations

  • Investigating age distribution further. . .
sum(df$age1<6)
[1] 1
sum(df$age1>=6 & df$age1<=18)
[1] 73769
sum(df$age1>18)
[1] 261
  • While vast majority of rows have an age between 6 and 18,
    • 1 row has child less than 6,
    • 261 rows have child older than 18.

Examine Unusual Age Observations

Lets investigate the child with the reported age of 0.

df[df$age1==0,"sooind_id"]
[1] 1586
df[df$sooind_id==1586,c("sooind_id","wave","age1","hgc1")]
      sooind_id wave age1 hgc1
1418       1586    1    7    1
28916      1586    3   99    1
44394      1586    4    0    1
60001      1586    5    7    1

Examine Unusual Age Observations

  • The extreme ages appear to be data entry errors.

  • We could try to impute them, for example, impute that the previous child is age \(7\) in waves \(3\) and \(4\) based on his ages in waves \(1\) and \(5\).

  • Instead, we will set to missing.

Number of Observations

  • Data has 74031 rows (child-wave observations).
nrow(df)
[1] 74031
  • We do not have 74031 independent observations.

  • How many child-observations are in the data?

Number of Observations

  • Data has 74031 rows (child-wave observations).

  • How many child-observations are in the data?

    • Child id is sooind_id.

    • Use length and unique functions to find that there are 15669 child-observations in the data, of which 9799 are treated and 5870 are control.

length(unique(df$sooind_id))  
[1] 15669
length(unique(df[df$progresa1==1,"sooind_id"]))
[1] 9799
length(unique(df[df$progresa1==0,"sooind_id"]))
[1] 5870

Number of Observations

  • Data has 74031 rows (child-wave observations).

  • There are 15669 child-observations in the data, of which 9799 are treated and 5870 are control.

    • Child within the same village (and certainly in same family) may not be independent observations.
  • How many villages?

Number of Observations

  • Data has 74031 rows (child-wave observations).

  • There are 15669 child-observations in the data, of which 9799 are treated and 5870 are control.

  • How many villages?

    • Village id is sooloca.

    • Use length and unique functions to find that there are 491 villages, of which 308 are treated and 183 are control.

length(unique(df$sooloca))
[1] 491
length(unique(df[df$progresa1==1,"sooloca"]))
[1] 308
length(unique(df[df$progresa1==0,"sooloca"]))
[1] 183

Preparing data for analysis

  • Change name of variables for convenience:
    • progresa1 to treat,
    • sex1 to girl .
  • Using the subset function, restrict data to second wave (last baseline wave) and fifth wave (final post-treatment wave) and keep only relevant variables.
df$treat <- df$progresa1 
df$girl <- df$sex1 
df<-subset(df, wave==5 | wave==2,select= c(sooloca,sooind_id,age1,hgc1,school,treat,girl,wave))

Preparing data for analysis

  • Using the ifelse command, code as missing any age less than 6 or greater than 18.

  • Using the ifelse command, create an indicator variable for whether an observation is from the post period, i.e., a variable that equals 1 if the observation is from waves 5.

  • Create seperate data frames for pre and post treatment observations.

df$age1 <- ifelse(df$age1>=6 & df$age1<=18, df$age1, NA)  
df$post <- ifelse(df$wave==5, 1, 0)  
dfPre <- df[df$post==0,]  
dfPost <- df[df$post==1,]  

Preparing data for analysis

  • For more involved “data wrangling” with R, see either:
    • R Workflow by Frank Harrell, approach based on HMISC.
    • R for Data Science by Hadley Wickham and Garret Grolemund, approach based on tidyverse.

Balance at Baseline: Summary Table

  • Produce table examining balance in covariates between treated and control at baseline.

  • By randomization, no systematic differences in expectation.

  • However, some differences could arise by random chance.

Balance at Baseline
Control Treated Diff. Std Diff
Girl 0.50 0.48 -0.02 -0.04
Age 10.51 10.49 -0.02 -0.01
Highest Grade 3.26 3.28 0.02 0.01
Enrolled 0.87 0.87 0.002 0.01

Balance at Baseline: Summary Table

  • Standard to report table examining balance in baseline characteristics.
    • often including t-tests of null of equal means.
    • See Bruhn and McKenzie (2009) Section G.
Balance at Baseline
Control Treated Diff. Std Diff
Girl 0.50 0.48 -0.02 -0.04
Age 10.51 10.49 -0.02 -0.01
Highest Grade 3.26 3.28 0.02 0.01
Enrolled 0.87 0.87 0.002 0.01

Balance at Baseline: Summary Table

  • Researchers sometimes re-randomize if baseline not sufficiently balanced.
    • Does typically improve precision (similar to proper linear regression adjustment for baseline characteristics)
    • Invalidates conventional standard errors, inference.
Balance at Baseline
Control Treated Diff. Std Diff
Girl 0.50 0.48 -0.02 -0.04
Age 10.51 10.49 -0.02 -0.01
Highest Grade 3.26 3.28 0.02 0.01
Enrolled 0.87 0.87 0.002 0.01
  • See, e.g., Li, Ding, and Rubin (2018), Lin (2013), and Negi and Wooldridge (2021).

  • Rerandomization is inferior to other methods to design the experiment or to adjust for differences at baseline. See, e.g., Bai (2022)

Balance at Baseline: Summary Table

  • Code to produce previous table:
return_mean_by_treatment <- function(x){
  means.t<-tapply(x,dfPre$treat,mean)
  var.t<-tapply(x,dfPre$treat,var)
  return(c(means.t,var.t))
  }

vars <- c("girl","age1","hgc1","school")
output <- sapply(dfPre[vars],return_mean_by_treatment)

means <- output[1:2,]
diffs <- output[2,]-output[1,]
N1 <- sum(dfPre$treat)
N0 <- sum(1-dfPre$treat)
pooled.sd <- sqrt(((N0-1)*output[3,]+(N1-1)*output[4,])/(N0+N1-2))
std.diffs <- (output[2,]-output[1,])/pooled.sd 
results0 <- t(rbind(means,diffs,std.diffs))

colnames(results0)<-c("Control","Treated","Diff.","Std Diff")
varlabels <- c("Girl","Age","Highest Grade","Enrolled")
rownames(results0)<-c(varlabels)

stargazer(results0,  type="html", digits=2, title="Balance at Baseline")

Baseline: Enrollment by Grade, Sex

Baseline: Enrollment by Grade, Sex

  • Code to produce figure for boys:
MeanSchC.b  <-  with(subset(dfPre, treat==0 & girl==0),tapply(school, hgc1, mean))
MeanSchT.b <-  with(subset(dfPre, treat==1 & girl==0),tapply(school, hgc1, mean))

Grade <- as.factor(rep(c(1:10),2))
Group <- c(rep(" Control", 10), rep("Treated", 10))
MeanSch.b <- matrix(c(MeanSchC.b,MeanSchT.b), nrow = 20, ncol = 1)
tab.b <- data.frame(Grade, Group, MeanSch.b)
 
plotPre.b <- ggplot(tab.b, aes(x = Grade, y = MeanSch.b, fill = Group)) +
  geom_col(width = 0.7, position = position_dodge(width=0.8)) +
  theme_bw(base_size = 11) +
  theme(legend.position = "bottom", legend.title = element_blank()) +
  scale_y_continuous(breaks = seq(from = 0, to = 1, by = 0.1)) +
  xlab("Grade Level") +
  ylab("Mean School Enrollment")+
  ggtitle("Boys: School Enrollment by Grade, Pre Period")
 
ggplotly(plotPre.b,tooltip="y")  

Using Post Data to Estimate Effect

Using mean differences on post-treatment data to estimate effect of PROGRESA on school enrollment:

  • Overall: 0.041,

  • By sex:

    • For boys: 0.022,

    • For girls: 0.06.

mean(dfPost[ dfPost$treat==1,"school"]) -  mean(dfPost[ dfPost$treat==0,"school"])
# 0.04063189
mean(dfPost[dfPost$treat==1&dfPost$girl==0,"school"])-mean(dfPost[dfPost$treat==0&dfPost$girl==0,"school"])
# 0.02215276
mean(dfPost[dfPost$treat==1&dfPost$girl==1,"school"])- mean(dfPost[dfPost$treat==0&dfPost$girl==1,"school"])
# 0.05975947

Using Post Data to Estimate Effect

Using mean differences on post-treatment data to estimate effect of PROGRESA on school enrollment:

  • Overall: 0.041,

  • By sex:

    • For boys: 0.022,

    • For girls: 0.06.

  • Mean difference estimator justified by randomization, no selection bias.

  • Can take mean difference conditional on any covariate not effected by the treatment, including any baseline characteristic.

Estimate Effect by Grade, Sex

Code
MeansC <- with(dfPost,tapply(school,list(treat,girl,hgc1),mean))

dimnames(MeansC)[[1]] <-c("Control","Treated")
dimnames(MeansC)[[2]] <-c("Boys","Girls")

EffSchBoy <- (MeansC[2,1,] - MeansC[1,1,]) 
EffSchGirl <- (MeansC[2,2,] - MeansC[1,2,]) 

Grade <- as.factor(c(1:10, 1:10))
EffSch <- matrix(c(EffSchBoy, EffSchGirl), nrow = 20, ncol = 1)
sex <- c(rep("Boy", 10), rep("Girl", 10))
tabEff <- data.frame(Grade, sex, EffSch)

plotEff <- ggplot(tabEff, aes(x = Grade, y = EffSch, fill = sex)) +
  geom_col(width = 0.7, position = position_dodge(width=0.8)) +
  theme_bw(base_size = 10) +
  theme(legend.position = "bottom", legend.title = element_blank()) +
  scale_y_continuous(breaks = seq(from = -0.1, to = 0.1, by = 0.05)) +
  xlab("Grade Level") +
  ylab("Mean School Enrollment") +
  ggtitle(" Treatment Effects Estimates by Grade and Sex")

ggplotly(plotEff,tooltip="y")

Estimate Effect by Grade, Sex

  • Some results surprising, e.g. estimated effect on boys in first grade is -0.0676692.

  • Random sampling noise?

    • With i.i.d. data, \(\mbox{Var}(\bar{X}_N)=\frac{\sigma^2}{N}\), where \(\sigma^2\) is the variance of each \(X_i\).

    • The higher the number of observations, the lower the sampling noise.

    • Should we think of sample size as number of children in a sex, treatment status, grade cell? number of villages in a cell?

No. of Obs. by Grade-Sex-Treatment

Code
person.counts <- with(dfPost,tapply(sooind_id,list(treat,girl,hgc1),function(x){length(unique(x))}))
dimnames(person.counts)[[1]] <-c("Control","Treated")
dimnames(person.counts)[[2]] <-c("Boys","Girls")
 
Grade <- as.factor(rep(c(1:10),2))
Group <- c(rep(" Control", 10), rep("Treated", 10))
numbers.boys <- matrix(c(person.counts[1,1,],person.counts[2,1,]), nrow = 20, ncol = 1)
numbers.girls<- matrix(c(person.counts[1,2,],person.counts[2,2,]), nrow = 20, ncol = 1)

tab.boys <- data.frame(Grade, Group, numbers.boys)
tab.girls <- data.frame(Grade, Group, numbers.girls)

plot.n.boys <- ggplot(tab.boys, aes(x = Grade, y = numbers.boys, fill = Group)) +
  geom_col(width = 0.7, position = position_dodge(width=0.8)) +
  theme_bw(base_size = 11) +  
  theme(legend.position = "bottom", legend.title = element_blank()) +
  scale_y_continuous(limits=c(0,1100),breaks = seq(from = 0, to = 1100, by = 100)) +
  xlab("Grade Level") +
  ylab("Number of Boy Observations")+
  ggtitle("Number of Boy Observations, Treated and Control")



plot.n.girls <- ggplot(tab.girls, aes(x = Grade, y = numbers.girls, fill = Group)) +
  geom_col(width = 0.7, position = position_dodge(width=0.8)) +
  theme_bw(base_size = 11) +
  theme(legend.position = "bottom", legend.title = element_blank()) +
 scale_y_continuous(limits=c(0,1100),breaks = seq(from = 0, to = 1100, by = 100)) +
  xlab("Grade Level") +
  ylab("Number of Girl Observations")+
  ggtitle("Number of Girl Observations, Treated and Control")

  
village.counts <- with(dfPost,tapply(sooloca,list(treat,girl,hgc1),function(x){length(unique(x))}))
dimnames(village.counts)[[1]] <-c("Control","Treated")
dimnames(village.counts)[[2]] <-c("Boys","Girls")


numbers.boys.v <- matrix(c(village.counts[1,1,],village.counts[2,1,]), nrow = 20, ncol = 1)
numbers.girls.v<- matrix(c(village.counts[1,2,],village.counts[2,2,]), nrow = 20, ncol = 1)

tab.boys.v <- data.frame(Grade, Group, numbers.boys.v)
tab.girls.v <- data.frame(Grade, Group, numbers.girls.v)

plot.n.boys.v <- ggplot(tab.boys.v, aes(x = Grade, y = numbers.boys.v, fill = Group)) +
  geom_col(width = 0.7, position = position_dodge(width=0.8)) +
  theme_bw(base_size = 11) + 
  theme(legend.position = "bottom", legend.title = element_blank()) +
 scale_y_continuous(limits=c(0,1100),breaks = seq(from = 0, to = 1100, by = 100)) + 
  xlab("Grade Level") +
  ylab("Number of Villages with Boy Observations")+
  ggtitle("Number of Villages with Boy Observations")



plot.n.girls.v <- ggplot(tab.girls.v, aes(x = Grade, y = numbers.girls.v, fill = Group)) +
  geom_col(width = 0.7, position = position_dodge(width=0.8)) +
  theme_bw(base_size = 11) +
  theme(legend.position = "bottom", legend.title = element_blank()) +
 scale_y_continuous(limits=c(0,1100),breaks = seq(from = 0, to = 1100, by = 100)) + 
  xlab("Grade Level") +
  ylab("Number of Villages with Girl Observations")+
  ggtitle("Number of Villages with Girl Observations")

Results?

  • Estimated effects:

    • positive overall;

    • larger for girls than for boys;

    • vary widely across sex-grade cells.

  • To what degree are these results real, versus driven by sampling noise?

  • Next time: hypothesis testing!

References

Bai, Yuehao. 2022. “Optimality of Matched-Pair Designs in Randomized Controlled Trials.” arXiv Preprint arXiv:2206.07845.
Bruhn, Miriam, and David McKenzie. 2009. “In Pursuit of Balance: Randomization in Practice in Development Field Experiments.” American Economic Journal: Applied Economics 1 (4): 200–232. https://www.jstor.org/stable/25760187.
Grolemund, Garrett. 2014. Hands-on Programming with r: Write Your Own Functions and Simulations. " O’Reilly Media, Inc.".
Hlavac, Marek. 2022. “Stargazer: LaTeX Code and ASCII Text for Well-Formatted Regression and Summary Statistics Tables” R package version 5.2.3. https://CRAN.R-project.org/package=stargazer.
(IFPRI), International Food Policy Research Institute. 2018. Mexico, Evaluation of PROGRESA.” Harvard Dataverse. https://doi.org/10.7910/DVN/05BMJY.
Lee, Soohyung, and Azeem M Shaikh. 2014. “Multiple Testing and Heterogeneous Treatment Effects: Re-Evaluating the Effect of Progresa on School Enrollment.” Journal of Applied Econometrics 29 (4): 612–26. https://home.uchicago.edu/~amshaikh/webfiles/progresa.pdf.
Li, Xinran, Peng Ding, and Donald B Rubin. 2018. “Asymptotic Theory of Rerandomization in Treatment–Control Experiments.” Proceedings of the National Academy of Sciences 115 (37): 9157–62.
Lin, Winston. 2013. “Agnostic Notes on Regression Adjustments to Experimental Data: Reexamining Freedman’s Critique.” The Annals of Applied Statistics 7 (1): 295–318.
Negi, Akanksha, and Jeffrey M Wooldridge. 2021. “Revisiting Regression Adjustment in Experiments with Heterogeneous Treatment Effects.” Econometric Reviews 40 (5): 504–34.
Sievert, Carson. 2020. Interactive Web-Based Data Visualization with r, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.
Skoufias, Emmanuel, Susan W Parker, Jere R Behrman, and Carola Pessino. 2001. “Conditional Cash Transfers and Their Impact on Child Work and Schooling: Evidence from the Progresa Program in Mexico [with Comments].” Economia 2 (1): 45–96. https://www.jstor.org/stable/20065413.
Thomas, Duncan. 1990. “Intra-Household Resource Allocation: An Inferential Approach.” Journal of Human Resources, 635–64.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.